CUSTOMER CHURN PREDICTION
¶
Loading libraries¶
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')
In [2]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split,cross_val_score, GridSearchCV
from sklearn.metrics import accuracy_score, roc_curve, recall_score, confusion_matrix, precision_score
from sklearn.metrics import f1_score, classification_report, r2_score, auc
Loading the dataset¶
In [3]:
df=pd.read_csv("Telco_Customer_Churn.csv")
copied_df=df.copy()
Understanding the data¶
In [4]:
pd.options.display.max_columns=21
df.head()
Out[4]:
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | Yes | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | 5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | No | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.5 | No |
| 2 | 3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | Yes | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | 7795-CFOCW | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | No | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | 9237-HQITU | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
checking size of the data¶
In [5]:
df.shape
Out[5]:
(7043, 21)
In [6]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7043 entries, 0 to 7042 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customerID 7043 non-null object 1 gender 7043 non-null object 2 SeniorCitizen 7043 non-null int64 3 Partner 7043 non-null object 4 Dependents 7043 non-null object 5 tenure 7043 non-null int64 6 PhoneService 7043 non-null object 7 MultipleLines 7043 non-null object 8 InternetService 7043 non-null object 9 OnlineSecurity 7043 non-null object 10 OnlineBackup 7043 non-null object 11 DeviceProtection 7043 non-null object 12 TechSupport 7043 non-null object 13 StreamingTV 7043 non-null object 14 StreamingMovies 7043 non-null object 15 Contract 7043 non-null object 16 PaperlessBilling 7043 non-null object 17 PaymentMethod 7043 non-null object 18 MonthlyCharges 7043 non-null float64 19 TotalCharges 7043 non-null object 20 Churn 7043 non-null object dtypes: float64(1), int64(2), object(18) memory usage: 1.1+ MB
In [7]:
df.columns.values
Out[7]:
array(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges',
'TotalCharges', 'Churn'], dtype=object)
In [8]:
df.columns=df.columns.str.lower()
df.columns.values
Out[8]:
array(['customerid', 'gender', 'seniorcitizen', 'partner', 'dependents',
'tenure', 'phoneservice', 'multiplelines', 'internetservice',
'onlinesecurity', 'onlinebackup', 'deviceprotection',
'techsupport', 'streamingtv', 'streamingmovies', 'contract',
'paperlessbilling', 'paymentmethod', 'monthlycharges',
'totalcharges', 'churn'], dtype=object)
In [9]:
df.dtypes
Out[9]:
customerid object gender object seniorcitizen int64 partner object dependents object tenure int64 phoneservice object multiplelines object internetservice object onlinesecurity object onlinebackup object deviceprotection object techsupport object streamingtv object streamingmovies object contract object paperlessbilling object paymentmethod object monthlycharges float64 totalcharges object churn object dtype: object
visualizing missing values¶
In [10]:
msno.matrix(df)
Out[10]:
<Axes: >
here in above visualization we can observe that there is no null values¶
- above we can see that there are 11 blank spaces
Dropping the customerid¶
In [11]:
df.drop(['customerid'],axis=1,inplace=True)
df.head()
Out[11]:
| gender | seniorcitizen | partner | dependents | tenure | phoneservice | multiplelines | internetservice | onlinesecurity | onlinebackup | deviceprotection | techsupport | streamingtv | streamingmovies | contract | paperlessbilling | paymentmethod | monthlycharges | totalcharges | churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | Yes | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | No | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.5 | No |
| 2 | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | Yes | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | No | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
In [12]:
df['totalcharges']=totaltharges=pd.to_numeric(df.totalcharges,errors='coerce')
df.isnull().sum()
Out[12]:
gender 0 seniorcitizen 0 partner 0 dependents 0 tenure 0 phoneservice 0 multiplelines 0 internetservice 0 onlinesecurity 0 onlinebackup 0 deviceprotection 0 techsupport 0 streamingtv 0 streamingmovies 0 contract 0 paperlessbilling 0 paymentmethod 0 monthlycharges 0 totalcharges 11 churn 0 dtype: int64
In [13]:
df[(df['totalcharges'].isna())]
Out[13]:
| gender | seniorcitizen | partner | dependents | tenure | phoneservice | multiplelines | internetservice | onlinesecurity | onlinebackup | deviceprotection | techsupport | streamingtv | streamingmovies | contract | paperlessbilling | paymentmethod | monthlycharges | totalcharges | churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 488 | Female | 0 | Yes | Yes | 0 | No | No phone service | DSL | Yes | No | Yes | Yes | Yes | No | Two year | Yes | Bank transfer (automatic) | 52.55 | NaN | No |
| 753 | Male | 0 | No | Yes | 0 | Yes | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | No | Mailed check | 20.25 | NaN | No |
| 936 | Female | 0 | Yes | Yes | 0 | Yes | No | DSL | Yes | Yes | Yes | No | Yes | Yes | Two year | No | Mailed check | 80.85 | NaN | No |
| 1082 | Male | 0 | Yes | Yes | 0 | Yes | Yes | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | No | Mailed check | 25.75 | NaN | No |
| 1340 | Female | 0 | Yes | Yes | 0 | No | No phone service | DSL | Yes | Yes | Yes | Yes | Yes | No | Two year | No | Credit card (automatic) | 56.05 | NaN | No |
| 3331 | Male | 0 | Yes | Yes | 0 | Yes | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | No | Mailed check | 19.85 | NaN | No |
| 3826 | Male | 0 | Yes | Yes | 0 | Yes | Yes | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | No | Mailed check | 25.35 | NaN | No |
| 4380 | Female | 0 | Yes | Yes | 0 | Yes | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | No | Mailed check | 20.00 | NaN | No |
| 5218 | Male | 0 | Yes | Yes | 0 | Yes | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | One year | Yes | Mailed check | 19.70 | NaN | No |
| 6670 | Female | 0 | Yes | Yes | 0 | Yes | Yes | DSL | No | Yes | Yes | Yes | Yes | No | Two year | No | Mailed check | 73.35 | NaN | No |
| 6754 | Male | 0 | No | Yes | 0 | Yes | Yes | DSL | Yes | Yes | No | Yes | No | No | Two year | Yes | Bank transfer (automatic) | 61.90 | NaN | No |
In [14]:
# blank_spaces=df.applymap(lambda x:x ==' ')
# blank_spaces.sum()
checking for black spaces¶
checking 0 year tenure value¶
In [15]:
df[df['tenure']==0]
Out[15]:
| gender | seniorcitizen | partner | dependents | tenure | phoneservice | multiplelines | internetservice | onlinesecurity | onlinebackup | deviceprotection | techsupport | streamingtv | streamingmovies | contract | paperlessbilling | paymentmethod | monthlycharges | totalcharges | churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 488 | Female | 0 | Yes | Yes | 0 | No | No phone service | DSL | Yes | No | Yes | Yes | Yes | No | Two year | Yes | Bank transfer (automatic) | 52.55 | NaN | No |
| 753 | Male | 0 | No | Yes | 0 | Yes | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | No | Mailed check | 20.25 | NaN | No |
| 936 | Female | 0 | Yes | Yes | 0 | Yes | No | DSL | Yes | Yes | Yes | No | Yes | Yes | Two year | No | Mailed check | 80.85 | NaN | No |
| 1082 | Male | 0 | Yes | Yes | 0 | Yes | Yes | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | No | Mailed check | 25.75 | NaN | No |
| 1340 | Female | 0 | Yes | Yes | 0 | No | No phone service | DSL | Yes | Yes | Yes | Yes | Yes | No | Two year | No | Credit card (automatic) | 56.05 | NaN | No |
| 3331 | Male | 0 | Yes | Yes | 0 | Yes | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | No | Mailed check | 19.85 | NaN | No |
| 3826 | Male | 0 | Yes | Yes | 0 | Yes | Yes | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | No | Mailed check | 25.35 | NaN | No |
| 4380 | Female | 0 | Yes | Yes | 0 | Yes | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | No | Mailed check | 20.00 | NaN | No |
| 5218 | Male | 0 | Yes | Yes | 0 | Yes | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | One year | Yes | Mailed check | 19.70 | NaN | No |
| 6670 | Female | 0 | Yes | Yes | 0 | Yes | Yes | DSL | No | Yes | Yes | Yes | Yes | No | Two year | No | Mailed check | 73.35 | NaN | No |
| 6754 | Male | 0 | No | Yes | 0 | Yes | Yes | DSL | Yes | Yes | No | Yes | No | No | Two year | Yes | Bank transfer (automatic) | 61.90 | NaN | No |
In [16]:
print(df[df['tenure']==0].count())
gender 11 seniorcitizen 11 partner 11 dependents 11 tenure 11 phoneservice 11 multiplelines 11 internetservice 11 onlinesecurity 11 onlinebackup 11 deviceprotection 11 techsupport 11 streamingtv 11 streamingmovies 11 contract 11 paperlessbilling 11 paymentmethod 11 monthlycharges 11 totalcharges 0 churn 11 dtype: int64
- so there are only 11 values whose tenure is 0 these values can be deleted
In [17]:
df.drop(labels=df[df['tenure']==0].index,axis=0,inplace=True)
df[df['tenure']==0].index
Out[17]:
Index([], dtype='int64')
In [18]:
df.isnull().sum()
Out[18]:
gender 0 seniorcitizen 0 partner 0 dependents 0 tenure 0 phoneservice 0 multiplelines 0 internetservice 0 onlinesecurity 0 onlinebackup 0 deviceprotection 0 techsupport 0 streamingtv 0 streamingmovies 0 contract 0 paperlessbilling 0 paymentmethod 0 monthlycharges 0 totalcharges 0 churn 0 dtype: int64
- the null values of Totalcharges columns are drops with the tenure 0 rows
brief descriptive summary¶
In [19]:
df.describe()
Out[19]:
| seniorcitizen | tenure | monthlycharges | totalcharges | |
|---|---|---|---|---|
| count | 7032.000000 | 7032.000000 | 7032.000000 | 7032.000000 |
| mean | 0.162400 | 32.421786 | 64.798208 | 2283.300441 |
| std | 0.368844 | 24.545260 | 30.085974 | 2266.771362 |
| min | 0.000000 | 1.000000 | 18.250000 | 18.800000 |
| 25% | 0.000000 | 9.000000 | 35.587500 | 401.450000 |
| 50% | 0.000000 | 29.000000 | 70.350000 | 1397.475000 |
| 75% | 0.000000 | 55.000000 | 89.862500 | 3794.737500 |
| max | 1.000000 | 72.000000 | 118.750000 | 8684.800000 |
In [20]:
df['seniorcitizen']=df.seniorcitizen.replace({0:'No',1:'Yes'})
df.head()
Out[20]:
| gender | seniorcitizen | partner | dependents | tenure | phoneservice | multiplelines | internetservice | onlinesecurity | onlinebackup | deviceprotection | techsupport | streamingtv | streamingmovies | contract | paperlessbilling | paymentmethod | monthlycharges | totalcharges | churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Female | No | Yes | No | 1 | No | No phone service | DSL | No | Yes | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | Male | No | No | No | 34 | Yes | No | DSL | Yes | No | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.50 | No |
| 2 | Male | No | No | No | 2 | Yes | No | DSL | Yes | Yes | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | Male | No | No | No | 45 | No | No phone service | DSL | Yes | No | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | Female | No | No | No | 2 | Yes | No | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
In [21]:
df.describe()
Out[21]:
| tenure | monthlycharges | totalcharges | |
|---|---|---|---|
| count | 7032.000000 | 7032.000000 | 7032.000000 |
| mean | 32.421786 | 64.798208 | 2283.300441 |
| std | 24.545260 | 30.085974 | 2266.771362 |
| min | 1.000000 | 18.250000 | 18.800000 |
| 25% | 9.000000 | 35.587500 | 401.450000 |
| 50% | 29.000000 | 70.350000 | 1397.475000 |
| 75% | 55.000000 | 89.862500 | 3794.737500 |
| max | 72.000000 | 118.750000 | 8684.800000 |
In [22]:
df.describe(include =['object'])
Out[22]:
| gender | seniorcitizen | partner | dependents | phoneservice | multiplelines | internetservice | onlinesecurity | onlinebackup | deviceprotection | techsupport | streamingtv | streamingmovies | contract | paperlessbilling | paymentmethod | churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 7032 | 7032 | 7032 | 7032 | 7032 | 7032 | 7032 | 7032 | 7032 | 7032 | 7032 | 7032 | 7032 | 7032 | 7032 | 7032 | 7032 |
| unique | 2 | 2 | 2 | 2 | 2 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 2 | 4 | 2 |
| top | Male | No | No | No | Yes | No | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Electronic check | No |
| freq | 3549 | 5890 | 3639 | 4933 | 6352 | 3385 | 3096 | 3497 | 3087 | 3094 | 3472 | 2809 | 2781 | 3875 | 4168 | 2365 | 5163 |
In [23]:
fig,ax=plt.subplots(1,2,figsize=(10,10))
g_labels = ['Male', 'Female']
gender_count=df.gender.value_counts()
churn_count=df.churn.value_counts()
c_labels = ['No', 'Yes']
gender_count
ax[0].pie(gender_count,autopct='%0.1f%%',labels=g_labels,startangle=90, shadow=True, wedgeprops={'width':0.6})
ax[1].pie(churn_count,autopct='%0.1f%%',labels=c_labels,startangle=90, shadow=True, wedgeprops={'width':0.6})
plt.show()
- 26.6 % customers switched to another firm.
- 49.9 % are female and 50.5 % are male customer.
In [24]:
df["churn"][df["churn"]=="No"].groupby(by=df["gender"]).count()
Out[24]:
gender Female 2544 Male 2619 Name: churn, dtype: int64
In [25]:
churn_no_count=df["churn"][df["churn"]=="No"].groupby(by=df["gender"]).count().sum()
churn_no_count
Out[25]:
5163
In [26]:
df["churn"][df["churn"]=="Yes"].groupby(by=df.gender).count()
Out[26]:
gender Female 939 Male 930 Name: churn, dtype: int64
In [27]:
churn_yes_count=df["churn"][df["churn"]=="Yes"].groupby(by=df.gender).count().sum()
churn_yes_count
Out[27]:
1869
In [28]:
plt.figure(figsize=(4,4))
labels =["Churn: Yes","Churn:No"]
values = [1869,5163]
labels_gender = ["F","M","F","M"]
sizes_gender = [939,930 , 2544,2619]
colors = ['#ff6666', '#66b3ff']
# colors=['pink','lightblue']
colors_gender = ['#c2c2f0','#ffb3e6', '#c2c2f0','#ffb3e6']
explode = (0.3,0.3)
explode_gender = (0.1,0.1,0.1,0.1)
textprops = {"fontsize":15}
#Plot
plt.pie(values, labels=labels,autopct='%1.1f%%',pctdistance=1.08, labeldistance=0.8,colors=colors, startangle=90,frame=True, explode=explode,radius=10, textprops =textprops, counterclock = True, )
plt.pie(sizes_gender,labels=labels_gender,colors=colors_gender,startangle=90, explode=explode_gender,radius=7, textprops =textprops, counterclock = True, )
#Draw circle
centre_circle = plt.Circle((0,0),5,color='black', fc='white',linewidth=0)
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
plt.title('Churn Distribution w.r.t Gender: Male(M), Female(F)', fontsize=15, y=1.1)
# show plot
plt.axis('equal')
plt.tight_layout()
plt.show()
- There is negligible difference in customer percentage who changed the service provider. Both genders behaved in similar way when it comes to migrating to another service provider.
In [29]:
fig=px.histogram(df, x=df.contract,color='contract',title='<b> customer contract distribution<b>')
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
In [30]:
fig=px.histogram(df, x=df.churn,color='contract',barmode='group',title='<b> customer churn within contract distribution<b>')
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
what are the percentage of churn on contract basis¶
In [31]:
labels = df['paymentmethod'].unique()
values = df['paymentmethod'].value_counts()
fig = go.Figure(data=[go.Pie(labels=labels, values=values, hole=.3)])
fig.update_layout(title_text="<b>Payment Method Distribution</b>")
fig.show()
In [32]:
fig = px.histogram(df, x="churn", color="paymentmethod",barmode='group', title="<b>Customer Payment Method distribution w.r.t. Churn</b>")
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
- Major customers who moved out were having Electronic Check as Payment Method.
- Customers who opted for Credit-Card automatic transfer or Bank Automatic Transfer and Mailed Check as Payment Method were less likely to move out.
In [33]:
df['internetservice'].unique()
Out[33]:
array(['DSL', 'Fiber optic', 'No'], dtype=object)
In [34]:
df[df["gender"]=="Male"][["internetservice", "churn"]].value_counts()
Out[34]:
internetservice churn DSL No 992 Fiber optic No 910 No No 717 Fiber optic Yes 633 DSL Yes 240 No Yes 57 Name: count, dtype: int64
In [35]:
df[df["gender"]=="Female"][["internetservice", "churn"]].value_counts()
Out[35]:
internetservice churn DSL No 965 Fiber optic No 889 No No 690 Fiber optic Yes 664 DSL Yes 219 No Yes 56 Name: count, dtype: int64
In [36]:
fig = go.Figure()
fig.add_trace(go.Bar(
x = [['Churn:No', 'Churn:No', 'Churn:Yes', 'Churn:Yes'],
["Female", "Male", "Female", "Male"]],
y = [965, 992, 219, 240],
name = 'DSL',
))
fig.add_trace(go.Bar(
x = [['Churn:No', 'Churn:No', 'Churn:Yes', 'Churn:Yes'],
["Female", "Male", "Female", "Male"]],
y = [889, 910, 664, 633],
name = 'Fiber optic',
))
fig.add_trace(go.Bar(
x = [['Churn:No', 'Churn:No', 'Churn:Yes', 'Churn:Yes'],
["Female", "Male", "Female", "Male"]],
y = [690, 717, 56, 57],
name = 'No Internet',
))
fig.update_layout(title_text="<b>Churn Distribution w.r.t. Internet Service and Gender</b>")
fig.show()
- A lot of customers choose the Fiber optic service and it's also evident that the customers who use Fiber optic have high churn rate, this might suggest a dissatisfaction with this type of internet service.
Customers having DSL service are majority in number and have less churn rate compared to Fibre optic service.
In [37]:
color_map = {"Yes": "#FF97FF", "No": "#AB63FA"}
fig = px.histogram(df, x="churn", color="dependents", barmode="group", title="<b>Dependents distribution</b>", color_discrete_map=color_map)
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
- customer without dependents are more likely to churn
In [38]:
color_map = {"Yes": '#FFA15A', "No": '#00CC96'}
fig = px.histogram(df, x="churn", color="partner", barmode="group", title="<b>Chrun distribution w.r.t. Partners</b>", color_discrete_map=color_map)
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
- customers that doesn't have partners are more likely to churn
In [39]:
color_map = {"Yes": '#00CC96', "No": '#B6E880'}
fig = px.histogram(df, x="churn", color="seniorcitizen",barmode='group', title="<b>Chrun distribution w.r.t. Senior Citizen</b>", color_discrete_map=color_map)
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
- senior citizens are very less compared to others
- almost half of the total senior citizen churns
In [40]:
color_map = {"Yes": "#FF97FF", "No": "#AB63FA"}
fig = px.histogram(df, x="churn", color="onlinesecurity", barmode="group", title="<b>Churn w.r.t Online Security</b>", color_discrete_map=color_map)
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
- Most customers churn in the absence of online security
In [41]:
color_map = {"Yes": '#FFA15A', "No": '#00CC96'}
fig = px.histogram(df, x="churn", color="paperlessbilling", title="<b>Chrun distribution w.r.t. Paperless Billing</b>", color_discrete_map=color_map)
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
- Customers with Paperless Billing are most likely to churn.
In [42]:
fig = px.histogram(df, x="churn", color="techsupport",barmode="group", title="<b>Chrun distribution w.r.t. TechSupport</b>")
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
- Customers who didnot get any TechSupport are most likely to migrate to another service provider.
In [43]:
color_map = {"Yes": '#00CC96', "No": '#B6E880'}
fig = px.histogram(df, x="churn", color="phoneservice", title="<b>Chrun distribution w.r.t. Phone Service</b>", color_discrete_map=color_map)
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
- there are less customers who don't have phone service and out of that a small fraction of customre are likely to churn.
In [44]:
sns.set_context('paper',font_scale=1.1)
ax=sns.kdeplot(df.monthlycharges[df.churn=='No'],color='red',shade=True)
ax=sns.kdeplot(df.monthlycharges[df.churn=='Yes'],color='blue',shade=True)
ax.legend(["Not Churn","Churn"],loc='upper right');
ax.set_ylabel('Density');
ax.set_xlabel('Monthly Charges');
ax.set_title('Distribution of monthly charges by churn');
In [45]:
df.head()
Out[45]:
| gender | seniorcitizen | partner | dependents | tenure | phoneservice | multiplelines | internetservice | onlinesecurity | onlinebackup | deviceprotection | techsupport | streamingtv | streamingmovies | contract | paperlessbilling | paymentmethod | monthlycharges | totalcharges | churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Female | No | Yes | No | 1 | No | No phone service | DSL | No | Yes | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | Male | No | No | No | 34 | Yes | No | DSL | Yes | No | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.50 | No |
| 2 | Male | No | No | No | 2 | Yes | No | DSL | Yes | Yes | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | Male | No | No | No | 45 | No | No phone service | DSL | Yes | No | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | Female | No | No | No | 2 | Yes | No | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
In [46]:
ax = sns.kdeplot(df.totalcharges[(df["churn"] == 'No') ],
color="Gold", shade = True);
ax = sns.kdeplot(df.totalcharges[(df["churn"] == 'Yes') ],
ax =ax, color="Green", shade= True);
ax.legend(["Not Churn","Churn"],loc='upper right');
ax.set_ylabel('Density');
ax.set_xlabel('Total Charges');
ax.set_title('Distribution of total charges by churn');
In [47]:
fig = px.box(df, x='churn', y = 'tenure')
# Update yaxis properties
fig.update_yaxes(title_text='Tenure (Months)')
# Update xaxis properties
fig.update_xaxes(title_text='Churn')
# Update size and title
fig.update_layout(autosize=True, width=750, height=600,
title_font=dict(size=25, family='Courier'),
title='<b>Tenure vs Churn</b>',
)
fig.show()
- New customers are more likely to churn
In [48]:
plt.figure(figsize=(25, 10))
corr = df.apply(lambda x: pd.factorize(x)[0]).corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
ax = sns.heatmap(corr, mask=mask, xticklabels=corr.columns, yticklabels=corr.columns, annot=True, linewidths=.2, cmap='coolwarm', vmin=-1, vmax=1)
7. data preprocessing¶
we need to convert all the nominal categorical data to numerical data¶
In [49]:
X=df.drop('churn',axis=1)
y=df.churn
In [50]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)
In [51]:
col=['gender','seniorcitizen','partner','dependents','phoneservice','multiplelines','internetservice','onlinesecurity','onlinebackup','deviceprotection','techsupport','streamingtv','streamingmovies','contract','paperlessbilling','paymentmethod']
transformer = ColumnTransformer(transformers=[
('tnf1',OneHotEncoder(sparse=False,drop='first'),col)
],remainder='passthrough')
In [52]:
X_train_transformed=transformer.fit_transform(X_train)
X_test_transformed=transformer.transform(X_test)
In [53]:
le=LabelEncoder()
y_train_transformed=le.fit_transform(y_train)
y_test_transformed=le.transform(y_test)
In [54]:
X_train_transformed.shape
Out[54]:
(5625, 30)
In [55]:
X_train.shape
Out[55]:
(5625, 19)
In [56]:
y_train.shape
Out[56]:
(5625,)
In [57]:
y_train_transformed.shape
Out[57]:
(5625,)
In [58]:
le.classes_
Out[58]:
array(['No', 'Yes'], dtype=object)
In [59]:
y_train_transformed
Out[59]:
array([1, 1, 1, ..., 0, 0, 1])
In [60]:
scaler=StandardScaler()
X_train_transformed=scaler.fit_transform(X_train_transformed)
X_test_transformed=scaler.transform(X_test_transformed)
model training, prediction and Evaluation¶
KNN¶
In [61]:
knn=KNeighborsClassifier(n_neighbors=12)
cv_score=cross_val_score(knn,X_train_transformed,y_train_transformed,cv=5)
print('cross validation score:', cv_score)
print('Mean cv accuracy:', np.mean(cv_score))
cross validation score: [0.80266667 0.792 0.75733333 0.78222222 0.8 ] Mean cv accuracy: 0.7868444444444445
In [62]:
knn.fit(X_train_transformed,y_train_transformed)
knn_pred=knn.predict(X_test_transformed)
knn_accuracy=accuracy_score(knn_pred,y_test_transformed)
In [63]:
print(classification_report(y_test_transformed, knn_pred))
precision recall f1-score support
0 0.82 0.88 0.85 1033
1 0.60 0.48 0.53 374
accuracy 0.78 1407
macro avg 0.71 0.68 0.69 1407
weighted avg 0.76 0.78 0.77 1407
Random forest¶
In [64]:
rf=RandomForestClassifier()
rf.fit(X_train_transformed,y_train_transformed)
Out[64]:
RandomForestClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier()
In [65]:
# make prediction
rf_pred=rf.predict(X_test_transformed)
rf_accuracy=accuracy_score(rf_pred,y_test_transformed)
print(rf_accuracy)
0.7846481876332623
In [66]:
print(classification_report(y_test_transformed,rf_pred))
precision recall f1-score support
0 0.82 0.90 0.86 1033
1 0.63 0.47 0.54 374
accuracy 0.78 1407
macro avg 0.73 0.69 0.70 1407
weighted avg 0.77 0.78 0.77 1407
In [67]:
plt.figure(figsize=(4,3))
sns.heatmap(confusion_matrix(y_test_transformed,rf_pred),annot=True,fmt='d',linecolor='k',linewidth=1,cmap='copper')
plt.title("Random Forest Confusion Matrix",fontsize=14)
plt.show()
In [68]:
rf_pred_prob = rf.predict_proba(X_test_transformed)[:,1]
fpr_rf, tpr_rf, thresholds = roc_curve(y_test_transformed, rf_pred_prob)
plt.plot([0, 1], [0, 1], 'k--' )
plt.plot(fpr_rf, tpr_rf, label='Random Forest',color = "r")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Random Forest ROC Curve',fontsize=16)
plt.show();
In [69]:
# Assume that fpr, tpr, thresholds have already been calculated
optimal_idx = np.argmax(tpr_rf - fpr_rf)
optimal_threshold = thresholds[optimal_idx]
print("Optimal threshold is:", optimal_threshold)
Optimal threshold is: 0.31
In [70]:
auc_rf = auc(fpr_rf, tpr_rf)
Gradient Boosting classifier¶
In [71]:
gb = GradientBoostingClassifier()
gb.fit(X_train_transformed, y_train_transformed)
gb_pred = gb.predict(X_test_transformed)
print("Gradient Boosting Classifier", accuracy_score(y_test_transformed, gb_pred))
Gradient Boosting Classifier 0.7896233120113717
In [72]:
print(classification_report(y_test_transformed, gb_pred))
precision recall f1-score support
0 0.83 0.90 0.86 1033
1 0.64 0.48 0.55 374
accuracy 0.79 1407
macro avg 0.73 0.69 0.71 1407
weighted avg 0.78 0.79 0.78 1407
In [73]:
plt.figure(figsize=(4,3))
sns.heatmap(confusion_matrix(y_test_transformed, gb_pred),
annot=True,fmt = "d",linecolor="k",linewidths=3,cmap='hot')
plt.title("Gradient Boosting Classifier Confusion Matrix",fontsize=14)
plt.show()
In [74]:
gb_pred_prob = gb.predict_proba(X_test_transformed)[:,1]
fpr_gb, tpr_gb, thresholds = roc_curve(y_test_transformed, gb_pred_prob)
plt.plot([0, 1], [0, 1], 'k--' )
plt.plot(fpr_gb, tpr_gb, label='Gradient boosting',color = "r")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Gradient Boosting ROC Curve',fontsize=16)
plt.show();
In [75]:
# Assume that fpr, tpr, thresholds have already been calculated
optimal_idx = np.argmax(tpr_gb - fpr_gb)
optimal_threshold = thresholds[optimal_idx]
print("Optimal threshold is:", optimal_threshold)
Optimal threshold is: 0.28522236073425244
In [76]:
auc_gb = auc(fpr_gb, tpr_gb)
In [77]:
# Plot ROC curves
plt.figure(figsize=(10, 6))
plt.plot(fpr_rf, tpr_rf, label=f'Random Forest (AUC = {auc_rf:.2f})')
plt.plot(fpr_gb, tpr_gb, label=f'Gradient Boosting (AUC = {auc_gb:.2f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve Comparison')
plt.legend(loc='lower right')
plt.grid()
plt.show()
- Here we can say that Gradient Boosting is slightly performing better than Random forest
Conclusion ¶
Customer churn negatively impacts a firm's profitability.¶
Various strategies can be employed to mitigate customer churn:¶
- Deeply understand customers to prevent churn.
- Identify customers at risk of leaving and enhance their satisfaction.
- Prioritize improvements in customer service.
- Foster customer loyalty through personalized experiences and specialized services.
- Survey customers who have already left to understand their reasons for leaving.
- Adopt a proactive approach to prevent future churn.